Baseball is a great sport, and we are big fans of the game, and with the aid of Lahman’s Database, we are selecting 50 of our favorite baseball players and determining if height and weight affects how many homeruns are hit over the last 10 years of their career and their performance on fielding. Additionally, we seek to ascertain whether a player’s handedness significantly impacts their batting average across the same duration.
The datasets used in this analysis include the People, Batting, and Fielding datasets from Lahman’s Database, encompassing crucial variables such as playerID, nameFirst, nameLast, weight, height, bats, debut, finalGame from the People dataset, playerID, yearID, PO(Putouts), A(Assist), E(Error) from the Fielding dataset,and the playerID, AB(At Bats), H(Hits), and HR(Homerun) variables from the Batting dataset. To ensure a comprehensive and accurate analysis, the 10 seasons under consideration will be the 10 seasons leading up to each player’s final game season, since rookie seasons often lack sufficient data to draw meaningful conclusions.
Through this rigorous statistical investigation, we aim to uncover potential relationships between physical attributes and players’ batting and fielding performance, contributing to the ongoing discourse on the multifaceted dynamics shaping success in professional baseball.
IMPORT
#Read the 'People' Dataset and select the variables we need from the 'People' Dataset people <-subset(People, select =c(playerID, nameFirst, nameLast, weight, height, bats, debut, finalGame))tail(people, n=10)
playerID nameFirst nameLast weight height bats debut finalGame
21001 tallice99 Cedric Tallis NA NA <NA> <NA> <NA>
21002 bronfch99 Charles Bronfman NA NA <NA> <NA> <NA>
21003 somerch99 Charles Somers NA NA <NA> <NA> <NA>
21004 feenech99 Chub Feeney NA NA <NA> <NA> <NA>
21005 bushge99 George W. Bush NA NA <NA> <NA> <NA>
21006 paysojo99 Joan Payson NA NA <NA> <NA> <NA>
21007 galbrjo99 John Galbreath NA NA <NA> <NA> <NA>
21008 mcshejo99 John McSherry 351 75 <NA> <NA> <NA>
21009 weyerle99 Lee Weyer 258 78 <NA> <NA> <NA>
21010 palerst99 Steve Palermo 175 74 <NA> <NA> <NA>
#Read the 'Batting' Dataset and select the variables we need from the 'Batting' Dataset batting <-subset(Batting, select =c(playerID, yearID, AB, H, HR))tail(batting, n=10)
#Read the 'Fielding' Dataset and select the variables we need from the 'Fielding' Dataset fielding <-subset(Fielding, select =c(playerID, yearID, PO, A, E))tail(fielding, n=10)
playerID nameFirst nameLast weight height bats debut finalGame
3 beltrad01 Adrian Beltre 220 71 R 1998-06-24 2018-09-30
18 gonzaad01 Adrian Gonzalez 215 74 L 2004-04-18 2018-06-10
36 pujolal01 Albert Pujols 235 75 R 2001-04-02 2022-10-04
40 rodrial01 Alex Rodriguez 230 75 R 1994-07-08 2016-08-12
37 ramirar01 Aramis Ramirez 205 73 R 1998-05-26 2015-10-03
41 ruthba01 Babe Ruth 215 74 L 1914-07-11 1935-05-30
7 bondsba01 Barry Bonds 185 73 L 1986-05-30 2007-09-26
20 harpebr03 Bryce Harper 210 75 L 2012-04-28 2022-10-04
35 poseybu01 Buster Posey 213 73 R 2009-09-11 2021-10-03
13 crawfca02 Carl Crawford 230 74 L 2002-07-20 2016-06-03
50 yastrca01 Carl Yastrzemski 175 71 L 1961-04-11 1983-10-02
4 beltrca01 Carlos Beltran 215 73 B 1998-09-14 2017-10-01
25 jonesch06 Chipper Jones 210 76 B 1993-09-11 2012-10-03
34 ortizda01 David Ortiz 230 75 L 1997-09-02 2016-10-02
24 jeterde01 Derek Jeter 195 75 R 1995-05-29 2014-09-28
45 tatisfe01 Fernando Tatis 185 71 R 1997-07-26 2010-07-04
46 tatisfe02 Fernando Tatis 217 75 R 2019-03-28 2021-10-03
15 freemfr01 Freddie Freeman 220 77 L 2010-09-01 2022-10-05
8 brettge01 George Brett 185 72 L 1973-08-02 1993-10-03
1 aaronha01 Hank Aaron 180 72 R 1954-04-13 1976-10-03
27 killeha01 Harmon Killebrew 195 72 R 1954-06-23 1975-09-26
44 suzukic01 Ichiro Suzuki 175 71 L 2001-04-02 2019-03-21
39 robinja02 Jackie Robinson 195 71 R 1947-04-15 1956-09-30
30 mauerjo01 Joe Mauer 225 77 L 2004-04-05 2018-09-30
48 vottojo01 Joey Votto 220 74 L 2007-09-04 2022-08-16
2 altuvjo01 Jose Altuve 166 66 R 2011-07-20 2022-10-04
16 gehrilo01 Lou Gehrig 200 72 L 1923-06-15 1939-04-30
28 machama01 Manny Machado 218 75 R 2012-08-09 2022-10-05
38 ramirma02 Manny Ramirez 225 72 R 1993-09-02 2011-04-06
32 mcgwima01 Mark McGwire 215 77 R 1986-08-22 2001-10-07
22 hollima01 Matt Holliday 240 76 R 2004-04-16 2018-10-01
26 kempma01 Matt Kemp 225 76 R 2006-05-28 2020-09-25
29 mantlmi01 Mickey Mantle 195 71 B 1951-04-17 1968-09-28
9 cabremi01 Miguel Cabrera 267 76 R 2003-06-20 2022-10-04
42 schmimi01 Mike Schmidt 195 74 R 1972-09-12 1989-05-28
47 troutmi01 Mike Trout 235 74 R 2011-07-08 2022-10-05
5 bettsmo01 Mookie Betts 180 69 R 2014-06-29 2022-10-05
17 goldspa01 Paul Goldschmidt 220 75 R 2011-08-01 2022-10-04
14 fieldpr01 Prince Fielder 275 71 L 2005-06-13 2016-07-18
11 clemero01 Roberto Clemente 175 71 R 1955-04-17 1972-10-03
23 howarry01 Ryan Howard 250 76 L 2004-09-01 2016-10-02
43 sosasa01 Sammy Sosa 165 72 R 1989-06-16 2007-09-29
10 choosh01 Shin-Soo Choo 205 71 L 2005-04-21 2020-09-27
49 willite01 Ted Williams 205 75 L 1939-04-20 1960-09-28
21 heltoto01 Todd Helton 220 74 L 1997-08-02 2013-09-29
12 cobbty01 Ty Cobb 175 73 L 1905-08-30 1928-09-11
19 guerrvl01 Vladimir Guerrero 235 75 R 1996-09-19 2011-09-28
6 boggswa01 Wade Boggs 190 74 L 1982-04-10 1999-08-27
31 mayswi01 Willie Mays 170 70 R 1951-05-25 1973-09-09
33 molinya01 Yadier Molina 225 71 R 2004-06-03 2022-10-05
TRANSFORM
# Join players and batting dataset for and filter the last 10 seasons of the playersbattingData <-left_join(players, batting, by =join_by(playerID))|> dplyr::select(yearID, playerID, nameFirst, nameLast, weight, height, bats, AB, H, HR)|>group_by(playerID)|>filter(yearID >=max(yearID)-9)battingData
# A tibble: 501 × 10
# Groups: playerID [50]
yearID playerID nameFirst nameLast weight height bats AB H HR
<int> <chr> <chr> <chr> <int> <int> <fct> <int> <int> <int>
1 2009 beltrad01 Adrian Beltre 220 71 R 449 119 8
2 2010 beltrad01 Adrian Beltre 220 71 R 589 189 28
3 2011 beltrad01 Adrian Beltre 220 71 R 487 144 32
4 2012 beltrad01 Adrian Beltre 220 71 R 604 194 36
5 2013 beltrad01 Adrian Beltre 220 71 R 631 199 30
6 2014 beltrad01 Adrian Beltre 220 71 R 549 178 19
7 2015 beltrad01 Adrian Beltre 220 71 R 567 163 18
8 2016 beltrad01 Adrian Beltre 220 71 R 583 175 32
9 2017 beltrad01 Adrian Beltre 220 71 R 340 106 17
10 2018 beltrad01 Adrian Beltre 220 71 R 433 118 15
# ℹ 491 more rows
#Get the total homeruns of each playerbodyHR <- battingData|>summarize(sumHR =sum(HR))|>left_join(people) |> dplyr::select(playerID, nameFirst, nameLast, weight, height, sumHR)
#Get the batting average of each playerhandBatAvg <- battingData|>summarize(BatAvg =round(sum(H)/sum(AB), 3))|>left_join(people) |> dplyr::select(playerID, nameFirst, nameLast, bats, BatAvg)
Joining with `by = join_by(playerID)`
handBatAvg
# A tibble: 50 × 5
playerID nameFirst nameLast bats BatAvg
<chr> <chr> <chr> <fct> <dbl>
1 aaronha01 Hank Aaron R 0.286
2 altuvjo01 Jose Altuve R 0.313
3 beltrad01 Adrian Beltre R 0.303
4 beltrca01 Carlos Beltran B 0.278
5 bettsmo01 Mookie Betts R 0.294
6 boggswa01 Wade Boggs L 0.304
7 bondsba01 Barry Bonds L 0.314
8 brettge01 George Brett L 0.293
9 cabremi01 Miguel Cabrera R 0.285
10 choosh01 Shin-Soo Choo L 0.266
# ℹ 40 more rows
# Join players and fielding dataset for and filter the last 10 seasons of the playersfieldingData<-left_join(players, fielding, by =join_by(playerID))|> dplyr::select(yearID, playerID, nameFirst, nameLast, weight, height, PO, A, E)|>group_by(playerID)|>filter(yearID >=max(yearID)-9)fieldingData
# A tibble: 647 × 9
# Groups: playerID [50]
yearID playerID nameFirst nameLast weight height PO A E
<int> <chr> <chr> <chr> <int> <int> <int> <int> <int>
1 2009 beltrad01 Adrian Beltre 220 71 103 224 14
2 2010 beltrad01 Adrian Beltre 220 71 138 285 19
3 2011 beltrad01 Adrian Beltre 220 71 93 208 11
4 2012 beltrad01 Adrian Beltre 220 71 95 209 8
5 2013 beltrad01 Adrian Beltre 220 71 93 232 14
6 2014 beltrad01 Adrian Beltre 220 71 144 206 12
7 2015 beltrad01 Adrian Beltre 220 71 105 267 17
8 2016 beltrad01 Adrian Beltre 220 71 104 301 10
9 2017 beltrad01 Adrian Beltre 220 71 50 135 5
10 2018 beltrad01 Adrian Beltre 220 71 53 145 10
# ℹ 637 more rows
#Get the fielding percentage of each playerfieldingPercent <- fieldingData|>summarize(FieldingPercentage =round((sum(PO)+sum(A))/(sum(PO)+sum(A)+sum(E)), 3))|>left_join(people) |> dplyr::select(playerID, nameFirst, nameLast, weight, height, FieldingPercentage)
#Between Weight and HomerunsbodyHRweight <-ggplot(bodyHR, aes(x=sumHR, y=weight))+geom_point()+labs(title ="Comparison of Homeruns and Players' Weight",x ="Number of Home Runs",y="Players' Weight")+theme_minimal( )bodyHRweight
#Between Height and HomerunsbodyHRheight <-ggplot(bodyHR, aes(x=sumHR, y=height))+geom_point()+labs(title ="Comparison of Homeruns and Players' Height",x ="Number of Home Runs",y="Players' Height")+theme_minimal( )bodyHRheight
#Between Weight, Height, and HomerunsbodyHRboth <-ggplot(bodyHR, aes(x=height, y=weight, color = sumHR, size = sumHR))+geom_point(alpha=0.7)+labs(title ="Comparison of Homeruns and Players' Height and Weight",x ="Players' Height",y ="Players' Weight")+theme_minimal( )bodyHRboth
#3D version of bodyHRbothbodyHRboth3D <-plot_ly(x=bodyHR$height,y=bodyHR$weight,z=bodyHR$sumHR) |>layout(scene =list(xaxis =list(title ="Height"),yaxis =list(title ="Weight"),zaxis =list(title ="Total Number of Homeruns")))bodyHRboth3D
No trace type specified:
Based on info supplied, a 'scatter3d' trace seems appropriate.
Read more about this trace type -> https://plotly.com/r/reference/#scatter3d
No scatter3d mode specifed:
Setting the mode to markers
Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
Batting Average by Handedness
#Batting Average by HandednesshandedAvg <-ggplot(handBatAvg, aes(y=BatAvg, x= bats, fill=bats))+geom_boxplot()+labs(title ="Batting Average by Players' Handedness",x =NULL,y="Players' Career Batting Average")+theme_minimal( )handedAvg
#Number of right-handed, left-handed, and both-handed hitters in the datasetnumbats <-table(handBatAvg$bats)numbats
B L R
3 20 27
Impact of Weight and Height on Fielding Percentage
#Between Weight and Fielding PercentagebodyFPweight <-ggplot(fieldingPercent, aes(x=FieldingPercentage, y=weight))+geom_point()+labs(title ="Comparison of Fielding Percentage and Players' Weight",x ="Fielding Percentage",y="Players' Weight")+theme_minimal( )bodyFPweight
#Between Height and Fielding PercentagebodyFPheight <-ggplot(fieldingPercent, aes(x=FieldingPercentage, y=height))+geom_point()+labs(title ="Comparison of Fielding Percentage and Players' Height",x ="Fielding Percentage",y="Players' Height")+theme_minimal( )bodyFPheight
#Between Weight, Height, and HomerunsbodyFPboth <-ggplot(fieldingPercent, aes(x=height, y=weight, color = FieldingPercentage, size = FieldingPercentage))+geom_point(alpha=0.7)+labs(title ="Comparison of Fielding Percentage and Players' Height and Weight",x ="Players' Height",y ="Players' Weight")+theme_minimal( )bodyFPboth
#3D version of bodyFPbothbodyFPboth3D <-plot_ly(x=fieldingPercent$height,y=fieldingPercent$weight,z=fieldingPercent$FieldingPercentage) |>layout(scene =list(xaxis =list(title ="Height"),yaxis =list(title ="Weight"),zaxis =list(title ="FieldingPercentage")))bodyFPboth3D
No trace type specified:
Based on info supplied, a 'scatter3d' trace seems appropriate.
Read more about this trace type -> https://plotly.com/r/reference/#scatter3d
No scatter3d mode specifed:
Setting the mode to markers
Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
Communicate
Impact of Weight and Height on Homeruns
We have created insightful plots with GGplot and plot_ly to provide a visually compelling representation of our questions. Through these visualizations, we aim to find the relationship between weight, height, and the number of homeruns of the selected 50 baseball players during their careers.
We expected that taller and heavier hitters would record higher numbers of the sum of homeruns during their careers. However, our expectation was not met from the dataset.
It is evident from our analysis that the correlation between weight and height does not invariably dictate one’s ability to achieve a high homerun count; however, it is noteworthy to acknowledge that there may be instances where these factors play a role in influencing such outcomes. When we increase our samples, there could be a relationship between weight, height, and the number of homeruns. Since there are only 50 samples, we might not be able to find a relationship.
Batting Average by Handedness
From the boxplot we can see that the batting average of left-handed hitter is the highest. The mean of the batting average of left-handed is 0.299. And the batting average of right-handed hitter is the second highest. The mean of the batting average of right-handed is 0.290. For the lowest batting average, both-handed is lowest. The mean of the batting average of both-handed is 0.286.
We expected that left-handed’s batting average would be highest and both-handed’s batting average would be lowest. And indeed, our expectations were correct.
There are a lot of left-handed hitters these days, but since there have been a lot more right-handed hitters, left-handed hitters may be able to target pitchers better because pitchers have relatively little experience with left-handed hitters. And left-handed hitters are more likely to hit infield than right-handed hitters because the first base is closer.
A both-handed hitter can play with his right hand when his opponent is a left-handed pitcher, and with his left hand when he is a right-handed pitcher, but that doesn’t mean they are better than highly trained left-handed hitters, or highly trained right-handed hitters. In fact, as long as they don’t use both hands well with real talent, they’re likely to just be an ambiguous both-handed hitter.
Impact of Weight and Height on Fielding Percentage
We expected that shorter and heavier hitters would record lower fielding percentage during their careers. However, our expectation was not met from the dataset.
After creating plots to compare the effect of weight and height on fielding percentage, we can analyze the correlation to some degree. Based on these plots, there is no correlation.